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Abstract 



We provide a new representation-independent formulation of Occam's razor theo- 
rem, based on Kolmogorov complexity. This new formulation allows us to: (i) Obtain 
better sample complexity than both length-based [4] and VC-based [3] versions of 
Occam's razor theorem, in many applications; and (ii) Achieve a sharper reverse 
of Occam's razor theorem than that of [5]. Specifically, we weaken the assumptions 
made in [5] and extend the reverse to superpolynomial running times. 

Key words: Analysis of algorithms, pac-learning, Kolmogorov complexity, 
Occam's razor-style theorems 



1 Introduction 

Occam's razor theorem as formulated by [3,4] is arguably the substance of 
efficient pac learning. Roughly speaking, it says that in order to (pac-) learn, it 
suffices to compress. A partial reverse, showing the necessity of compression, 
has been proved by Board and Pitt [5]. Since the theorem is about the relation 
between effective compression and pac learning, it is natural to assume that a 
sharper version ensues by couching it in terms of the ultimate limit to effective 
compression which is the Kolmogorov complexity. We present results in that 
direction. 

Despite abundant research generated by its importance, several aspects of 
Occam's razor theorem remain unclear. There are basically two versions. The 
VC dimension-based version of Occam's razor theorem (Theorem 3.1.1 of [3]) 
gives the following upper bound on sample complexity: For a hypothesis space 
H with VCdim(H) = d, 1 < d < oo, 
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The following lower bound was proved by Ehrenfeucht et al [6] . 

m(H,5,e) > max(^— (2) 
32e e o 



The upper bound in (1) and the lower bound in (2) differ by a factor 0(log ^). 
It was shown in [8] that this factor is, in a sense, unavoidable. 

When H is finite, one can directly obtain the following bound on sample 
complexity for a consistent algorithm: 

m(H,5,e) < - In H (3) 
e o 



For a graded boolean space H n , we have the following relationship between 
the VC dimension d of H n and the cardinality of H n , 

d < log \H n \ <nd. (4) 



When log|if„| = 0(d) holds, then the sample complexity upper bound given 
by (3) can be seen to equal \(0(d) + ln|) which matches the lower bound 
of (2) up to a constant factor, and thus every consistent algorithm achieves 
optimal sample complexity for such hypothesis spaces. 

The length-based version of Occam's razor theorem then gives the following 
sample complexity m to guaranty that the algorithm pac-learns: For given e 
and 5: 

m = max(- In \, (^1^)1/(1-)), (5) 
e o e 



This bound is based on the length-based Occam algorithm [3]: A deterministic 
algorithm that returns a consistent hypothesis of length at most m a s f3 , where 
a < 1 and s is the length of the target concept. 

In summary, the VC dimension based Occam's razor theorem may be hard to 
use and it sometimes does not give the best sample complexity. The length- 
based Occam's razor is more convenient to use and often gives better sample 
complexity in the discrete case. 

However, as we demonstrate below, the fact that the length-based Occam's 
razor theorem sometimes gives inferior sample complexity, can be due to 
the redundant representation format of the concept. We believe Occam's ra- 
zor theorem should be "representation-independent". That is, it should not 
be dependent on accidents of "representation format". (See [16] for other 
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representation- independence issues.) In fact, the sample complexities given 
in (1) and (2) are indeed representation-independent. However they are not 
easy to use and do not give optimal sample complexity. Here, we give a Kol- 
mogorov complexity based Occam's razor theorem. We will demonstrate that 
our KC-based Occam's razor theorem is convenient to use (as convenient as the 
length based version), gives a better sample complexity than the length based 
version, and is representation- independent. In fact, the length based version 
can be considered as a specific computable approximation to the KC-based 
Occam's razor. 

As one of the examples, we will demonstrate that the standard trivial learning 
algorithm for monomials actually often has a better sample complexity than 
the more sophisticated Haussler's greedy algorithm [7]. This is contrary to 
the commen, but mistaken, belief that Haussler's algorithm is better in all 
cases (to be sure, Haussler's method is superior for target monimials of small 
length). Another issue related to Occam's razor theorem is the status of the 
reverse assertion. Although a partial reverse of Occam's razor theorem has 
been proved by [5] , it applied only to the case of polynomial running time and 
sample complexity. They also required a property of closure under exception 
list. This latter requirement, although quite general, excludes some reasonable 
concept classes. Our new formulation of Occam's razor theorem allows us to 
prove a more general reverse of Occam's razor theorem, allowing the arbitrary 
running time and weakening the requirement of exception list of [5]. 

□ 

Discussion of Result and Technique: In our approach we obtain bet- 
ter bounds on the sample complexity to learn the representation of a tar- 
get concept in the given representation system. These bounds, however, are 
representation-independent and depend only on the Kolmogorov complexity of 
the target concept. If we don't care about the representation of the hypothesis 
(but that is not the case in this paper) then better "iff Occam style" charac- 
terizations of polynomial time learnability/predicatability can be given. They 
rely on Schapire's result that "weak learnability" equals "strong learnability" 
in polynomial time [13] exploited in [9]. For a recent survey of the important 
related "boosting" technique see [14]. 

The use of Kolmogorov complexity is to obtain a bound on the size of the 
hypotheses class for a fixed (but arbitrary) target concept. Obviously, the re- 
sults described can be obtained using other proof methods — all true provable 
statements must be provable from the axioms of mathematics by the inference 
methods of mathematics. The question is whether a particular proof method 
facilitates and guides the proving effort. The message we want to convey is 

1 A preliminary version was presented at the 8th Intn'l Computing and Combina- 
torics Conference ( COCOON), held in Singapore, August, 2002. 
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that thinking in terms of coding and incompressibility suggest improvements 
to long-standing results. A survey of the use of the Kolmogorov complexity 
method in combinatorics, computational complexity and the analysis of algo- 
rithms is [12] Chapter 6. 



2 Occam's Razor 

Let us assume the usual definitions, say Anthony and Biggs [1], and notation 
of [5]. For Kolmogorov complexity we assume the basics of [12]. 

In the following S, T is are finite alphabets: We consider only discrete learning 
problems in this paper. The set of finite strings over E is denoted by E* and 
similarly for T. An element of E* is an example, and a concept is a set of 
examples (a language over E). An representation is an element of T*. 

Definition 1 A representation system is a tuple (R, T, c, X), where R C T* is 
the set of representations , and c : R — > 2 s * maps representations to concepts, 
the latter being languages over E. 

Hence, given R the mapping c determines a concept class. For example, let 
T is the alphabet to express Boolean formulas, E = {0, 1}, and let R be the 
subset of disjunctive normal form (DNF) formulas. Let c map each element 
r G R, say a DNF formula over n variables, to c(r) C {0, l} n such that every 
example e G c(r) viewed as truth- value assignment makes r "true" . That is, if 
e — e± . . .e n and we assign "true" or "false" to the ith variable in r according to 
whether equals "0" or "1" then r becomes "true" . Each concept in the thus 
defined concept class is the set of truth assignments that make a particular 
DNF formula "true". 

Definition 2 A pac-algorithm for a representation system R = (R, T, c, E) is 

a randomized algorithm L such that, for every s,n > 1, e > 0, 5 > 0, r G R- s , 
and every probability distribution D on E- n ; if L is given s, n, e, 8 as input and 
has access to an oracle providing examples of c(r) (the concept represented by 
r) according to D, then L, with probability at least 1 — 5, outputs a represen- 
tation r' G R approximating the target r in the sense that D(c(r')Ac(r)) < e. 
Here, A denotes the symmetric set difference. 

The acronym "pac" coined by Dana Angluin stands for "probably approxi- 
mately correct" which aptly captures the requirement the output represen- 
tation must satisfy according to the definition. The question of interest in 
pac-learning is how many examples (and running time) a learning algorithm 
has to qualify as a pac-alpgorithm. The running time and and number of ex- 
amples (sample complexity) of the pac-algorithm are expressed as functions 
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t(n, s, e, 5) and m(n, s, e, 5). The following definition generalizes the notion of 
Occam algorithm in [3]: 



Definition 3 An Occam-algorithm for a representation system R = (R, I, c, S) 

is a randomized algorithm which for every s, n > 1, 7 > ; on input of a sam- 
ple consisting of m examples of a fixed target r e R- s , with probability at least 
1 — 7 outputs a representation r' G R consistent with the sample, such that 
K{r J I r,n,s) < m/f(m, n, s, 7), with f(m, n, s,^), the compression achieved, 
being an increasing function ofm. 

The length-based version of (possibly randomized) Occam algorithm can be 
obtained by replacing K[r J \ r,n, s) by |r| in this definition. The running time 
of the Occam-algorithm is expressed as a function t(m, n, s, 7), where n is the 
maximum length of the input examples. 

Remark 1 An Occam algorithm satisfying a given /, achieves a lower bound 
on the number m of examples required in terms of Kir 1 \ r,n,s), the Kol- 
mogorov complexity of the outputted representation conditioned on the tar- 
get representation, rather than the (maximal) length s of r as in the original 
Occam algorithm [3] and the length-based version above. This improvement 
enables one to use information drawn from the hidden target for reduction of 
the Kolmogorov complexity of the output representation, and hence further 
reduction of the required sample complexity. 

We need to show that the main properties of an Occam algorithm are preserved 
under this generalization. Our first theorem is a Kolmogorov complexity based 
Occam's Razor. We denote the minimum m such that f{m, n, 5,7) > x by 
f^ 1 (x,n,s, r y), where we set f^^x, n, s, 7) = 00 if f(m, n, 5,7) < x for every 
m. 

Theorem 1 Suppose we have an Occam- algorithm for R = (R, T, c, S) with 
compression f(m, n, §,7). Then there is a pac-learning algorithm for R with 
sample complexity 



and running time tp a c(n, s, e, 5) = t ccam( m ( n , s , e > 8), n > s j ^/2)- 

Proof. On input of e, 5, s, n, the learning algorithm will take a sample of 
length m = m(n, s, e, S) from the oracle, then use the Occam algorithm with 
7 = 5/2 to find a hypothesis (with probability at least 1 — 5/2) consistent with 
the sample and with low Kolmogorov complexity. In the proof we abbreviate 
f{m, n, 5,7) to f{m) with the other parameters implicit. Learnability follows 
in the standard manner from bounding (by the remaining 5/2) the probability 



m(n, s, e, 5) = max < - In -, / x ( 
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that all m examples of the target concept fall outside the, probability e or 
greater, symmetric difference with a bad hypothesis. Let m = m(n,s,e,5). 
Then m > n, s, gives 

ln2 e 

e "7R- 2' 



and therefore m > \ In | gives 

In 2 , ,2 
m e-—— > In-. 

f(m) 5 



This implies (taking the exponent on both sides and using 1 — e < e e ) 
2 m//(m) (l -e) m < 5/2- 

The probability that some concept the Occam-algorithm can output has all 
m examples being bad is at most the number of concepts of complexity less 
than m//(m), times (1 — e) m , which by the above is at most 5/2. □ 

Corollary 1 When the compression is of the form 

/(m,n,s,7) 



m i- a 



p(n,s,7)' 



one can achieve a sample complexity of 



In i/ie special case of total compression, where a = 0, this further reduces to 
-|max(ln-,(ln2)p(n,s,5/2))J. (6) 



For deterministic Occam-algorithms, we can furthermore replace 2/5 and 5/2 
in Theorem 1 by 1/5 and 5 respectively. 

Remark 2 Essentially, our new Kolmogorov complexity condition is a com- 
putationally universal generalization of the length condition in the original 
Occam's razor theorem of [4]. Here, in Theorem 1, we consider the shortest 
description length over all effective representations, given the target repre- 
sentation, rather than in a specific (syntactical) representation system. This 
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allows us to bound the required sample complexity not by a function of the 
number of hypotheses (returned representations) of length at most the bound 
on the length of the target representation, but by a similar function of the num- 
ber of hypotheses that have a certain Kolmogorov complexity conditioned on 
the target concept, see Remark 1. Nonetheless, like in the original Occam's ra- 
zor Theorem of [4] , we return a representation of a concept approximating the 
target concept in the given representation system, rather than a representation 
outside the system like in Boosting approaches. 

Suppose we have a concept c and a mis-classified example x — an exception. 
Then, the symmetric difference cA{x} classifies x correctly: if x G" c then 
cA{x} = clj{^}, and if x G c then cA{x} — c \ {x}. 

Definition 4 An exception handler for a representation system R = (R, T, c, E) 
is an algorithm which on input of a representation r G R of length s, and an 
x G E* of length n, outputs a representation r' G R of the concept c(r)A{x}, 
of length at most e(s,n), where e is called the exception expansion function. 
The running time of the exception-handler is expressed as a function t(n, s) of 
the representation and exception lengths. Ift(n,s) is polynomial in n,s, and 
furthermore e(s, n) is of the form s+p(n) for some polynomial p, then we say 
R is polynomially closed under exceptions. 

Theorem 2 Let L be a deterministic pac-algorithm with m(n, s, the 
sample size, and let E be an exception handler for a representation system 
R. Then there is an Occam algorithm for R that for m examples achieves 
compression f(m,n, 5,7) = Moreover, m > 2nm(n, s, ^-,7) and where e, 
depending on m, n, s, 7, is such that m(n, s, e, 7) = em holds. 

Proof. The proof is obtained in a fashion similar to [5]. Suppose we are given 
a sample of length m and confidence parameter 7. Assume without loss of 
generality that the sample contains m different examples. Define a uniform 
distribution on these examples with fi(x) = 1/m for each x in the sample. 
Let e be as described. The function m(n, s, e, 7) decreases with increasing e, 
while the function em increases with e so the two necessarily intersect, under 
the assumption in the theorem, for some eo, although it may yield an t > ^, 
giving no actual compression. For example, if m(n, s,e, 7) = for some 
constant b, then eo = m _1// ( b+1 ). Apply L with 5 = 7 and e = eo- With proba- 
bility 1 — 7, it produces a concept which is correct with error e, giving up to 
em exceptions. We can just add these one by one using the exception handler. 
This will expand the concept size, but not the Kolmogorov complexity. The 
resulting representation can be described by the < em examples used plus 
the < em exceptions found, Since L is deterministic, this uniquely determines 
the required consistent concept. The compression achieved is = 2m - This 
is an increasing function of m, since increasing the slope of the function em 
moves its intersection with the function m(n, s,e, 7) to the left, that is, to 
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smaller e. □ 



Definition 5 Let R = (R, I, c, S) be a representation system. The concept 
MAJ(r±,r 2 , r 3 ) is the set {x : x belongs to at least two out of the three concepts 
c(ri), c(r 2 ), c(r 3 )}. ^4 majority-of-three algorithm for R zs an algorithm which 
on input of three representation ri,r 2 ,r 3 G R- s , outputs a representation r' G 
R of the concept MAJ(ri,r 2 ,r 3 ) of length at most e(s), where e is called the 
majority expansion function. The running time of the algorithm is expressed 
as a function t(s) of the maximum representation length. Ift(s) and e(s) are 
polynomial in s then we say R is polynomially closed under majority-of-three. 

Theorem 3 Let L be a deterministic pac-algorithm with sample complexity 
m(n,s,e,5) G o(l/e 2 ), and let M be a majority-of-three algorithm for the 
representation system R. Then there is an Occam algorithm for R that for m 
examples has compression f(m, n, s, 7) = m/3nm(n, s, 7/3)- 

Proof. Let us be given a sample of length m. Take 5 = 7/3 and e = 

Stage 1: Define a uniform distribution on the m examples with /ii(x) = 1/m 
for each x in the sample. Apply the learning algorithm. It produces (with 
probability at least 1 — 7/3) a hypothesis r\ which has error less than e, giving 
up to em = y/m/2 exceptions. Denote this set of exceptions by Ex- 
Stage 2: Define a new distribution /i 2 (x) = e for each x G Ei, and /i 2 (x) = 
(1 — \Ei\/2y/m)/(m — \Ei\) for each x G" E x . Apply the learning algorithm. It 
produces (with probability at least 1 — 7/3) a hypothesis r 2 which is correct 
on all of Ei and with error less than e on the remaining examples. This gives 
up to e(m — |£i|)/(l — |£'i|/2y / m) < y/m exceptions. This set, denoted E 2 , is 
disjoint from Ei. 

Stage 3: Define a new distribution on the m examples with fj,(x) = l/\Ei U 
E 2 \ > e for each x in E\ U E 2 , and n(x) = elsewhere. Apply the learn- 
ing algorithm. The algorithm produces (with probability at least 1 — 7/3) a 
hypothesis r 3 which is correct on all of E x and E 2 . 

In total the number of examples consumed by the pac-algorithm is at most 
3m(n, s, 2^-> 7/3), each requiring n bits to describe. The three representations 
are combined into one representation by the majority-of-three algorithm M. 
This is necessarily correct on all of the m examples, since the three exception- 
sets are all disjoint. Furthermore, it can be described in terms of the ex- 
amples fed to the deterministic pac-algorithm and thus achieves compression 
f(m, n, 5,7) = m/3nm(n, s, 7/3). This is an increasing function of m 
given the assumed subquadratic sample complexity. □ 

The following corollaries use the fact that if a representation system is learn- 
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able, it must have finite VC-dimension and hence, according to (1), they are 
learnable with sample complexity subquadratic in K 

Corollary 2 Let a representation system R be closed under either exceptions 
or majority- of-three, or both. Then R is pac-learnable iff there is an Occam 
algorithm for R. 

Corollary 3 Let a representation system R be polynomially closed under ei- 
ther exceptions or majority- of-three, or both. Then R is deterministically poly- 
nomially pac-learnable iff there is a polynomial time Occam algorithm for R. 

Example. Consider threshold circuits, acyclic circuits whose nodes compute 
threshold functions of the form a\X\ + 02^2 + • • - + a n x n > S, Xi G {0, 1}, dj, 5 G 
N (note that no expressive power is gained by allowing rational weights and 
threshold). A simple way of representing circuits over the binary alphabet 
is to number each node and use prefix-free encodings of these numbers. For 
instance, encode i as l' n ^'0bin(i), the binary representation of i preceded 
by its length in unary. A complete node encoding then consists of the encoded 
index, encoded weights, threshold, encoded degree, and encoded indices of the 
nodes corresponding to its inputs. A complete circuit can be encoded with a 
node-count followed by a sequence of node-encodings. For this representation, 
a majority-of-three algorithm is easily constructed that renumbers two of its 
three input representations, and combines the three by adding a 3-input node 
computing the majority function x\ + X2 + x 3 > 2. It is clear that under this 
representation, the system of threshold circuits are polynomially closed under 
majority-of-three. On the other hand they are not closed under exceptions, or 
under the exception lists of [5]. 

Example. Let hi, h 2 , h 3 be 3 /c-DNF formulas. Then MAJ(/ii, h 2 , h 3 ) = (hi A 
h 2 ) V (h 2 A /13) V (/13 A hi) which can be expanded into a 2/c-DNF formula. 
This is not good enough for Theorem 3, but it allows us to conclude that 
pac-learnability of /c-DNF implies compression of /c-DNF into 2/c-DNF. 



3 Applications 

Our KC-based Occam's razor theorem might be conveniently used, providing 
better sample complexity than the length-based version. In addition to giving 
better sample complexity, our new KC-based Occam's razor theorem, Theo- 
rem 1, is easy to use, as easy as the length based version, as demonstrated by 
the following two examples. While it is easy to construct an artificial system 
with extremely bad representations such that our Theorem 1 gives arbitrarily 
better sample complexity than the length-based sample complexity given in 
(5), we prefer to give natural examples. 
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Application 1: Learning a String. 

The DNA sequencing process can be modeled as the problem of learning a 
super-long string in the pac model [10,11]. We are interested in learning a 
target string t of length s, say s = 3 x 10 9 (length of a human DNA sequence). 
At each step, we can obtain as an example a substring of this sequence of 
length n, from a random location of t (Sanger's Procedure). At the time of 
writing, n « 500, and sampling is very expensive. Formally, the concepts 
we are learning are sets of possible length n substrings of a superstring, and 
these are naturally represented by the superstrings. We assume a minimal 
target representation (which may not hold in practice). Suppose we obtain a 
sample of m substrings (all positive examples). In biological labs, a Greedy 
algorithm which repeatedly merges a pair of substrings with maximum overlap 
is routinely used. It is conjectured that Greedy produces a common superstring 
t! of length at most 2s, where s is the optimal length (NP-hard to find). In 
[2], we have shown that s < \t'\ < 4s. Assume that \t'\ ~ 2s.Q Using the 
length-based Occam's razor theorem, that is, Theorem 2 with K(r' | r, s, n) 
in Definition 3 replaced by \r'\, this length of 2s would determine the sample 
complexity, as in (6), with p(n, s,5/2) = 2 ■ 2s (the extra factor 2 is the 2- 
logarithm of the size of the alphabet {A, C, G, T}). Is this the best we can do? 
It is well-known that the sampling process in DNA sequencing is a very costly 
and slow process. We improve the sample complexity using our KC-based 
Occam's razor theorem. 

Lemma 1 Let t be the target string of length s and t' be the superstring re- 
turned by Greedy of length at most 2s. Then 

K{t' | t, s, n) < 2s(2 log s + log n) fn. 

Proof. We give t! a short description using some information from t. Let 
S = {si, . . . , s m } be the set of m examples (substrings of t of length n). Align 
these substrings with the common superstring t', from left to right. Divide 
them into groups such that each group's leftmost string overlaps with every 
string in the group but does not overlap with the leftmost string of the previous 
group. Thus there are at most 2s/n such groups. To specify t', we only need to 
specify these 2s/n groups. After we obtain the superstring for each group, we 
re-construct t' by optimally merging the superstrings of neighboring groups. 
To specify each group, we only need to specify the first and the last string 
of the group and how they are merged. This is because every other string in 
the group is a substring of the string obtained by properly merging the first 
and last strings. Specifying the first and the last strings requires 2 log s bits of 



2 Although only the 4s upper bound was proved in [2], which has since been im- 
proved, it is widely believed that 2s is the true bound. 
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information to indicate their locations in t and we need another logn bits to 
indicate how they are merged. Thus K(t' \ t,s,n) < 2s(21ogs + log n)/n. □ 

This lemma shows that (6) can also be applied with p(n, s, 5/2) = 2-2s(2 logs+ 
log n) I n, giving a factor n/ (2 log s + log n) improvement in sample-complexity. 
Note that in (mammal) genome computation practice, we have n = 500 and 
s = 3 x 10 9 . The sample complexity using the Kolmogorov complexity-based 
Occam's razor is reduced over the "length based" Occam's razor by a multi- 
plicative factor of n/ (2 log s + logn) « 2x31+9 ~ ^' 

Application 2: Learning a Monomial. 

Consider boolean space of {0, l} n . There are two well-known algorithms for 
learning monomials. One is the standard algorithm. 

Standard Algorithm. 

(1) Initially set the concept representation M := x\Xi . . . x n x^ (a conjunction 
of all literals of n variables — which contradicts every example). 

(2) For each positive example, delete from the current M the literals that 
contradict the example. 

(3) Return the resulting monomial M. 

Haussler [7] proposed a more sophisticated algorithm based on set-cover ap- 
proximation as follows. Let k be the number of variables in the target mono- 
mial, and m be the number of examples used. 

Haussler's Algorithm. 

(1) Use only negative examples. For each literal x, define S x to be the set of 
negative examples such that x falsifies these negative examples. The sets 
associated with the literals in the target monomial form a set cover of 
negative examples. 

(2) Run the approximation algorithm of set cover, this will use at most 
klogm sets or, equivalently, literals in our approximating monomial. 

It is commonly believed that Haussler's algorithm has better sample com- 
plexity than the standard algorithm \^\ We demonstrate that the opposite is 
sometimes true (in fact for most cases), using our KC-based Occam's razor 
theorem, Theorem 1. Assume that our target monomial M is of length n — ^fn. 
Then the length-based Occam's razor theorem gives sample complexity n/e 
for both algorithms, by Formula 6. However, K(M' | M) < y^logS + 0(1), 
where M' is the monomial returned by the standard algorithm. This is true 

3 In fact, Haussler's algorithm is specifically aimed at reducing sample complexity 
for small target monomials, and that it does. 
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since the standard algorithm always produces a monomial M 1 that contains 
all literals of the target monomial M, and we need at most ^^logS + 0(1) 
bits to specify whether other literals are in (positive or negative) or not in M' 
for the variables that are in M' but not in M. Thus our (6) gives the sample 
complexity of 0(- v /n/e). In fact, as long as \M\ > n/\ogn (which is most likely 
to be the case if every monomial has equal probability), it makes sense to use 
the standard algorithm. 



4 Conclusions 

Several new problems are suggested by this work. If we have an algorithm 
that, given a length-m sample of a concept in Euclidean space, produces a 
consistent hypothesis that can be described with only m a ,a < 1 symbols 
(including a symbol for every real number; we're using uncountable represen- 
tation alphabet), then it seems intuitively appealing that this implies some 
form of learning. However, as noted in [5], the standard proof of Occam's 
Razor does not apply, since we cannot enumerate these representations. The 
main open question is under what conditions (specifically on the real number 
computation model) such an implication would nevertheless hold. 

Can we replace the exception element or majority of 3 requirement by some 
weaker requirement? Or can we even eliminate such closure requirement and 
obtain a complete reverse of Occam's razor theorem? Our current requirements 
do not even include things like k-DNF and some other reasonable representa- 
tion systems. 
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